Kernel-based aggregation of marker-level genetic association tests 

involving copy-number variation 



Genetic association tests involving copy-number variants (CNVs) are complicated by the 
fact that CNVs span multiple markers at which measurements are taken. The power of an 
association test at a single marker is typically low, and it is desirable to pool information across 
the markers spanned by the CNV. However, CNV boundaries are not known in advance, and 
the best way to proceed with this pooling is unclear. In this article, we propose a kernel-based 
method for aggregation of marker-level tests and explore several aspects of its implementa- 
tion. In addition, wc explore some of the theoretical aspects of marker-level test aggregation, 
proposing a permutation-based approach that preserved the family- wise error rate of the testing 
procedure, while demonstrating that several simpler alternatives fail to do so. The empirical 
power of the approach is studied in a number of simulations constructed from real data involv- 
ing a pharmacogenomic study of gemcitabinc, and compares favorably with several competing 
approaches. 

1 Introduction 

The classical genetic model states that humans have two copies of each gene, one on each 
chromosome. The sequencing of individual human genomes, however, has revealed that there is an 
unexpected amount of structural variation present in our genetic makeup. Sections of DNA can 
be deleted or duplicated, leaving even healthy individuals with fewer or more copies of portions of 
their genome (such individuals are said to have a CNV, or copy- number variation, at that position). 
Understanding the contribution of CNVs to human variation is one of the most compelling current 
challenges in genetics (McCarroll, 2008). 

As the coverage of single nucleotide polymorphism (SNP) arrays has increased, it is increasingly 
possibly to use this data to infer the CNV status of individuals. Indeed, recent generations of such 
chips include probes specifically designed to enable measurement of copy number (McCarroll et al., 
2008; Perkel, 2008). Although other technologies for determining copy number exist, the use of 
genotyping chips has a clear advantage in that hundreds of genome-wide association studies have 
been performed (Hindorff et al., 2012), and these studies may be mined for CNV data at no added 
cost. Many of these studies have been carried out with very large sample sizes, thereby enabling 
CNV studies on a scale that would otherwise be prohibitively expensive (Consortium, 2007; Barnes 
et al., 2008). These factors have led to a number of studies using data from high-density genotyping 
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arrays to investigate the nature of copy- number variation and its role in human variation and disease 
(Redon et al., 2006; Komura et al., 2006; Simon-Sanchez et al., 2008; Cooper et al., 2011; Konishi 
et al., 2011). 

Two general strategies have been proposed for conducting genetic association studies of copy- 
number variation. The majority of analytic techniques attempt to (1) identify CNVs at the level of 
the individual, then (2) carry out association tests of whether individuals with a CNV differ from 
individuals without a CNV in regard to disease or some other phenotype. An alternative approach 
is to reverse the order of those two steps: (1) carry out association testing at the single marker 
level, then (2) aggregate information from neighboring markers to determine CNVs associated with 
disease/phenotype. The key idea of both approaches is that, because the data is noisy, it is virtually 
impossible to determine copy number from a single marker. However, because copy number variants 
extend over multiple markers in a sufficiently high-density array, we are able to carry out inferences 
regarding CNVs by borrowing information across neighboring markers. 

We refer to these two approaches, respectively, as variant-level testing and marker-level testing. 
In (Breheny et al., 2012), the authors explored the relative strengths and weaknesses of the two 
approaches and reached the conclusion that marker-level approaches were better able to identify 
small, common CNVs, while variant-level approaches were better able to identify large, rare CNVs. 

One serious complication with variant-level testing is that the estimated CNV boundaries from 
different individuals do not, in general, coincide. This presents a number of difficulties. Whether 
or not two individuals with partially overlapping CNVs should be in the same risk group for the 
purposes of association testing is ambiguous, and complicates both the association test itself as 
well as attempts to correct for multiple comparisons. With a large sample size, the complexity of 
overlapping CNV patterns quickly becomes daunting. Marker-level testing is an attractive alter- 
native, as the aggregation is carried out on the test results, thereby avoiding the complications of 
overlapping boundaries. 

We illustrate the idea behind marker- level testing and aggregation in Figure 1, which plots 
a negative log transformation of the p-values of the marker-level tests versus position along the 
chromosome. The details of the hypothesis tests for this example are described in Breheny et al. 
(2012), but the p- values may arise from any test for association between copy number intensity 
(Peiffer et al., 2006) and phenotype. The salient feature of the plot is the cluster of small p- values 
between 102.5 and 102.7 Mb. The presence of so many low p-values in close proximity to one 
another suggests an association between the phenotype and copy-number variation in that region. 

To locate these clusters on a genome-wide scale, Breheny et al. (2012) used a marker-level 
approach based on circular binary segmentation (Olshen et al., 2004; Venkatraman and Olshen, 
2007). Here, we take a closer look at the problem of aggregating p- values from marker level tests. We 
present two main findings. First, we develop a computationally efficient kernel-based approach for 
p-value aggregation. Second, we analyze the multiple comparison properties of this approach, and 
of p-value aggregation in general. In particular, we demonstrate that naive aggregation approaches 
assuming exchangeability of test statistics do not preserve the family- wise error rate. To solve this 
problem, we present a permutation-based approach and show that it preserves family-wise error 
rates while maintaining competitive power. 
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Figure 1: Illustration of marker- level testing. The — login (p) values from the marker- level tests are 
plotted as a function of position along the chromosome. 

2 Kernel-based aggregation 

Throughout, we will use i to index subjects and j to index markers. Let Xij denote the intensity 
measurement for subject i at marker j, let yi denote the phenotype for subject i, and let pj denote 
the p- value arising from a test of association between intensity and phenotype at marker j. Finally, 
let £j denote the location of marker j along the chromosome and J denote the total number of 
markers. 

Consider the aggregation 

where tj = f(pj) is a function of the test results for marker j and Kh(lj,lo) is a kernel that assigns a 
weight to pj depending on how far away marker j is along the chromosome from the target location 
£q. The parameter h defines the bandwidth of the kernel and thereby controls the bias- variance 
tradeoff — a larger bandwidth pools test results over a larger region and thereby decreases variance, 
but potentially introduces bias by mixing test results beyond the boundary of a CNV with those 
inside the boundary. 

Although in principle one could apply (2.1) at any arbitrary location £q, we restrict attention 
here to locations at which a marker is present and for which the bandwidth does not extent beyond 
the borders of the chromosome, thereby obtaining a finite set of aggregates {Tj}. We will consider 
transformations f(pj) such that low p- values lead to large values of tj, leading to significance testing 
based on the statistic T = maxj{Tj}. 

In this section, we describe the choice of kernel and transformation f(pj), as well as the 
issue of incorporating the direction of association for signed tests. 
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2.1 Choice of kernel 



We consider two primary choices with regard to the kernel: shape and definition of bandwidth. 
First, we may consider varying the shape of the kernel. Two common choices are the flat ("boxcar") 
kernel and the Epanechnikov kernel: 

Hat: K h (( ,.(,)-{; *\ti-*o\<h {22) 

otherwise 




Epanechnikov: K h (£j,£ ) = { 4 T V h M ^ ^ - h ( 2 .3) 

otherwise. 

Intuitively, the Epanechnikov kernel would seem more attractive, as it gives higher weight to markers 
near the target location, and diminished weight to distant markers where bias is a larger concern. 

Besides varying the shape of the kernel, we consider two definitions of bandwidth, which we refer 
to as constant width and constant marker (these concepts are sometimes referred to as "metric" 
and "adaptive" bandwidths, respectively, in the kernel smoothing literature). In the constant width 
approach, as illustrated in (2.2)-(2.3), the width h of the kernel is constant. 

In contrast, the constant marker approach expands and contracts the range of the kernel as 
needed so that there are always k markers in the positive support of the kernel. Specifically, 
hk{(-o) = | A) — £[k] |> where £^ is the location of the A:th closest marker to x. For the constant width 
approach, the number of markers given positive weight by the kernel varies depending on £q. 

The general tradeoff between the two approaches is that as we vary the target location, £q, 
constant width kernels suffer from fluctuating variance because the effective sample size is not 
constant, whereas constant marker kernels suffer from fluctuating bias because the size of the region 
over which test results are pooled is not constant. We investigate the benefits and drawbacks of 
these various kernels in Section 4. 

As a point of reference, the flat, constant marker kernel is similar to the simple moving average, 
although not exactly the same. For example, consider the following illustration. 

1 2 3 4 5 6 
• • • • • • 



Suppose h = 3. At £3, the three nearest neighbors are {pi,P2,Ps}, while at £4, the three nearest 
neighbors are {pa,P5,Pg}- Thus, combinations such as {^3,^4,^5} are not considered by the kernel 
approach. This prevents the method from aggregating test results over inappropriately disperse 
regions of the chromosome. 



2.2 Transformations 

As suggested by (2.1), directly pooling p- values is not necessarily optimal. Various transforma- 
tions of p may be able to better discriminate true associations from noise. Specifically, we consider 
the following transformations: 

p: tj = l- P j (2.4) 

Z: t j = §~ 1 (l-p j ) (2.5) 

log: tj = -]ogpj, (2.6) 
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where the text to the left of the equation is the label with which we will refer to these transformations 
in later figures and tables. The transformations are constructed in such a way that low p-values 
produce high values of tj for all three transformations. 

All three transformations have a long history in the field of combining p-values. Forming a 
combination test statistic based on the sum of log p values (or equivalently, the log of the product 
of the p- values) was first proposed by Fisher (1925) — the so-called Fisher combination test. The 
transformation (2.5) was proposed by Stouffer et al. (1949), who also studied the properties of 
sums of these normal-transformed p-values. Finally, (2.4) was proposed and studied by Edgington 
(1972). Several authors have followed up on these proposals (Littell and Folks, 1971; Hedges and 
Olkin, 1985; Feller, 1968; Good, 1955). Throughout this literature, the majority of work has focused 
on these scales — uniform, Gaussian, and logarithmic — each of which has been shown to have 
advantages and drawbacks. There is no uniformly most powerful method of combining p-values 
(Zaykin et al., 2002). 

The present application differs from the classical work described above in that the borders of 
the CNVs are not known, and thus neither is the appropriate set of p-values to combine. Conse- 
quently, we must calculate many combinations {Tj}, which are partially overlapping and therefore 
not independent, thereby requiring further methodological extensions. The implications of these 
concerns is addressed in Section 3. 



2.3 Direction of association 

Some association tests (z tests, t tests) have a direction associated with them, while others 
(x 2 tests, F tests) do not. As we will see in Section 4, it is advantageous to incorporate this 
direction into the analysis when it is available, as it diminishes noise and improves detection. We 
introduce here extensions of the transformations presented in Section 2.2 that include the direction 
of association. 

Let Sj denote the direction of association for test j. For example, in a case control study, if 
intensities were higher for cases than controls at marker j, then Sj = 1. At markers where CNV 
intensities were higher for controls than cases, Sj = — 1. The signs are arbitrary; their purpose 
is to reflect the fact that switching directions of association are inconsistent with the biological 
mechanism being studied — an underlying, latent CNV that affects both phenotype and intensity 
measures — and thus likely to be noise. 

When Sj is available, we adapt the three transformations from 2.2 as follows: 

p: t j = s j (l- Pj ) (2.7) 

Z: t 3 = ^[ l + S ^-^ ) (2.8) 

log: tj = -Sj\ogpj. (2.9) 

All of these transformations have the same effect: when pj « and Sj = 1, tj » 0; when pj « 
and Sj = —1, tj <C 0; and when pj rj 1, tj fa regardless of the value of Sj. In other words, the 
test results combine to give an aggregate value T(£q) that is large in absolute value only if the test 
results have low p- values and are consistently in the same direction. 
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3 Significance testing and FWER control 



3.1 Exchangeability 

In any analysis that involves aggregating marker-level test results, it is of interest to be able to 
quantify the significance of regions like those depicted in Figure 1. This is not trivial, however, as the 
lack of exchangeability between test results complicates matters and causes various naive approaches 
to fail. In this section, we illustrate the consequences of non-exchangeability by comparing three 
approaches to establishing the combined significance of a region with a preponderance of low p- 
values. 

One approach, suggested in Breheny et al. (2012), is to use circular binary segmentation (CBS; 
implemented in the R package DNAcopy). This method aggregates neighboring p- values by calculat- 
ing the t-test statistic comparing the intensity of a given region with that of the surrounding region. 
The significance of this test statistic is quantified by comparing it to the distribution of maximum 
test statistics obtained by permuting the {pj} values (Olshen et al., 2004; Venkatraman and Olshen, 
2007). Crucially, however, this approach assumes that the test results {pj} are exchangeable as the 
justification for permuting them. 

Alternatively, we may use the kernel methods described in Section 2.1 to aggregate the neigh- 
boring test results, thereby obtaining T max = maxj{Tj}. One approach to approximating the null 
distribution of T max is to use Monte Carlo integration based on the fact that, under the null dis- 
tribution, pj ~ Uniform(0, l).Thus, for any choice of transformation and kernel in (2.1), we may 

generate an arbitrary number of independent draws {Tm2 x }^ =1 from the null distribution func- 
tion Fo of T max , and use the empirical CDF of those draws to obtain the estimate Fq. Thus, we 
obtain a test-like screening procedure for the presence of a CNV-phenotype association based on 
p = 1 — Fo(T max ). The crucial assumption here is that, under the null, the p- values are independent. 

An alternative to the Monte Carlo approach for quantifying the significance of T max , described 
fully in Section 3.2, involves obtaining Fo by permuting the phenotype prior to aggregation of the 
marker-level tests. 

Table 1: Preservation of Type I error for three methods with nominal a = .05 in two possible 
settings for which the null hypothesis holds. The simulated genomic region contained 200 markers, 
30 of which were spanned by a CNV. The CNV was present in either 0% or 50% of the samples, 
depending on the setting. A detailed description of the simulated data is given in Section 4. 





Circular 


Kernel 


Kernel 




binary 


Monte 


Permutation 




segmentation 


Carlo 




No CNV 


0.05 


0.06 


0.06 


No Association 


0.20 


0.54 


0.06 



Consider a genomic region in which individuals may have a CNV. The goal of the analysis is 
to detect CNVs associated with a particular phenotype. Thus, the null hypothesis may hold in 
one of two ways: (1, "No CNV") no individuals with CNVs are present in the sample, or (2, "No 
association") individuals with CNVs are present in the sample, but the CNV does not change the 
probability of developing the phenotype. Table 1 demonstrates that while all three methods have 
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the proper type I error rate under null hypothesis 1 (No CNV), only the permutation approach 
preserves the correct type I error in the case where a CNV is present, but not associated with 
the disease. This is due to the fact that when a CNV is present, although it is still true that the 
marginal distribution of each pj is Uniform(0,l), the CNV introduces correlation between nearby 
markers, thereby violating the assumptions of exchangeability and independence made by the CBS 
and Monte Carlo approaches. This phenomenon is also illustrated graphically in Figure 2. 
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Figure 2: Ability of Monte Carlo and Permutation approaches to maintain family-wise error rate 
under the two null scenarios. The implementation of CBS provided by DNAcopy does not return 
p- values (only whether they fall above or below a cutoff), and thus could not be included in this 
plot. 

We make the following additional observations: (1) The CBS approach is somewhat more robust 
to the exchangeability issue than the Monte Carlo approach; i.e., its type I error rate is not as badly 
violated. (2) The data simulated here for the "no association" setting are somewhat exaggerated: 
the CNV was present in 50% of the population and the signal to noise ratio was about twice as 
high as that typically observed in real data. In more realistic settings, the violation of type I error 
rate is not nearly as severe. The results in Table 1 and Figure 2 are intended to be an illustrative 
counterexample to demonstrate that CBS and kernel Monte Carlo are not guaranteed to preserve 
the type I error in all settings. (3) Circular binary segmentation was developed for the purpose 
of detecting CNVs, not aggregating marker-level tests, and its failure to preserve the family-wise 
error rate in this setting is in no way a criticism of CBS in general. 
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3.2 Permutation approach 

We now formally define the kernel permutation method introduced in Section 3.1 and show that 
it preserves the family-wise error rate for the problem of CNV association testing. For a given set 
of test results {pj}, consider quantifying whether or not the data represent compelling evidence for 
a CNV-phenotype association using the statistic 

T max = max{T j }. (3.1) 

3 

If the tests are directional, with results {pj, Sj}, we use T max = maxj{|Tj|}. 

To obtain the null distribution of T max , we use a permutation approach, generating up to n! 
unique draws {Tm2 x }? 1 from the permutation distribution of T max . The procedure is as follows. 
At any given iteration, draw a random vector of phenotypes by permuting the original vector of 
phenotypes. Next, carry out marker level tests of association between the original CNV intensities 
and the permuted vector of phenotypes, obtaining a vector of permutation test results {pf^}- 

Finally, apply the kernel aggregation procedure described in Section 2.1 to obtain {T^} and T^L. 

We may then use the empirical CDF of these draws from the permutation distribution of T max 
to obtain the estimate Fq. Thus, we obtain a global test for the presence of a CNV-phenotype 
association based on p = 1 — Fo(T max ). By preserving the correlation structure of the original CNV 
intensities, this approach does not rely on any assumptions of exchangeability or independence 
across neighboring markers, and is thereby able to preserve the type I error rate of the testing 
procedure, unlike the other approaches described in Section 3.1. We now formally present this 
result, the proof of which appears in the Appendix. 

Theorem 1. Let Hq denote the hypothesis that the phenotype, yi, and the vector of CNV intensities, 
Xj, are independent. Then, using the permutation approach described above for any of the kernel 
aggregation approaches in Section 2.1, for any a € (0, 1), 

F(Type I error) < a. (3.2) 

It is worth pointing out that the above theorem is proven for the case in which all permutations 
of {yi} are considered. In practice, as it is usually impractical to consider all permutations, only 
a random subset of these permutations are considered. However, by the law of large numbers, the 
above conclusion still holds approximately, and may be made as precise as necessary by increasing 
the value of B, the number of permutations evaluated. For the numerical results in Section 4, we 
use B = 1,000. 

The global test above is of limited practical benefit in the sense that it does not indicate the 
location of the associated CNV. Thus, we also consider the following equivalent marker-level test: 
declare significant evidence for the presence of a CNV-phenotype association at any marker for 
which Tj > i ? _1 (l — ot). Below, we state the corollary to Theorem 1 for the kernel permutation 
method, viewed as a multiple testing procedure for each marker. 

Corollary 1. Let H$j denote the hypothesis that the phenotype, yi, and the CNV intensity at marker 
j, Xij, are independent. Then, under the global null hypothesis that yi is jointly independent of 
{Xij}, for any a G (0, 1), 

P(At least one Type I error) < a (3-3) 
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using the permutation approach described above and Tj > Fq (1 — a) as the test function for Hqj. 
In other words, the testing procedure described above controls the FWER in the weak sense at level 
a. 

It is worth noting that the procedure controls the FWER only in the weak sense — in other 
words, that it limits the probability of a false declaration of a CNV only under the global null 
hypothesis that there are no CNVs associated with the outcome. Typically in multiple testing sce- 
narios, this is undesirable and strong control is necessary. However, in the case of CNV-phenotype 
association, strong control is impractical, as it would imply that a method not only identifies CNV- 
phenotype associations, but can perfectly detect the genomic boundary of any associated CNV. 
This is an unrealistic requirement; in practice, there is no way to prevent the possibility that a 
detected CNV-phenotype association may spill over beyond the boundary of the CNV. 

4 Simulations 

4.1 Design of spike-in simulations 

In this section, we study the ability of the proposed approach to detect CNV-phenotype as- 
sociations using simulated CNVs and corresponding intensity measurements. The validity of our 
conclusions depends on how realistic the simulated data is, so we have given careful thought to 
simulating this data in as realistic a manner as possible. The spike-in design that we describe here 
is also described in Breheny et al. (2012). 

The basic design of our simulations is to use real data for the gemcitabine study described in 
Breheny et al. (2012), "spike" a signal into it, then observe the frequency with which we can recover 
that signal. We used circular binary segmentation (Olshen et al., 2004; Venkatraman and Olshen, 
2007) to estimate each sample's underlying mean intensity at every position along the chromosome, 
then subtracted the actual intensity measurement from the estimated mean to obtain a matrix of 
residuals representing measurement error. This matrix, denoted R, has 172 rows (one for each cell 
line) and 70,542 columns (one for each marker). 

We then used these residuals to simulate noise over short genomic regions in which a single 
simulated CNV is either present or absent. Letting i denote subjects and j denote markers, the 
following variables are generated: Zi, an indicator for the presence or absence of a CNV in individual 
i; Xij, the intensity measurement at marker j for individual i; and yi, the phenotype. We focus here 
on a random sampling design in which the outcome is continuous; similar results were obtained from 
a case-control sampling design in which the outcome is binary (CHECK). In the random sampling 
design, the CNV indicator, Zi, is generated from a Bernoulli distribution, where 7 = P(z« = 1) is the 
frequency of the CNV in the population; subsequently, yi\Zi is generated from a normal distribution 
whose mean depends on Z{. 

At the beginning of each simulated data set, 200 markers were randomly selected from the 
columns of R. The measurement error for simulated subject i was then drawn from the observed 
measurement errors at those markers for a randomly chosen row of R. Thus, within a simulated 
data set, all subjects are studied with respect to the same genetic markers, but the markers vary 
from data set to data set. Simulating the data in this way preserves all the features of outliers, 
heavy-tailed distributions, skewness, unequal variability among markers, and unequal variability 
among subjects that are present in real data. 
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The intensity measurements {xij} derive from these randomly sampled residuals. To the noise, 
we add a signal that depends on the presence of the simulated CNV, Z{. The added signal is equal 
to zero unless the simulated genome contains a CNV encompassing the jth marker; otherwise the 
added signal is equal to the standard deviation of the measurement error times the signal to noise 
ratio. Our simulations employed a signal-to-noise ratio of 0.8, which corresponded roughly to a 
medium-sized detectable signal based on our inspection of the gemcitabine data. Note that the 
phenotype and intensity measurement are conditionally independent given the latent copy-number 
status Zj. An illustration of the spike- in process is given in Figure 3. 
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Figure 3: Illustration of spike-in simulation design. Left: The noise, randomly drawn from among 
the estimated measurement errors for a single cell line. Middle: The spiked-in signal. Right: The 
resulting simulated data. 

For the Illumina HumanlM-Duo BeadChip, which has a median spacing of 1.5 kb between 
markers, 200 markers corresponds to simulating a 300 kb genomic region. We varied the length of 
the CNV from 10 to 50 markers, corresponding to a size range of 15 to 75 kb. For the simulations 
presented in the remainder of the article, we used a sample size of n = 1,000 and an effect size 
(change in mean divided by standard deviation) for the continuous outcome of 0.4. The simulation 
study we conducted was rather extensive. For the sake of clarity, we present representative examples 
here; the full set of simulation results is available in supplementary materials. 



4.2 Transformations 

We begin by examining the impact on power of the various transformations proposed in Sec- 
tions 2.2 and 2.3. In order to isolate the effect of transformation, we focus here on the "optimal 
bandwidth" results: the bandwidth of the kernel was chosen to match the number of markers in 
the underlying CNV. This will lead to the maximum power to detect a CNV-phenotype associa- 
tion, although this approach is clearly not feasible in practice, as the size of an underlying CNV is 
unknown. The robustness of the kernel method to choice of bandwidth is explored in Figure 5. 

The relationship between power and transformation choice is illustrated in Figure 4. The 
figure illustrates a basic trend that held consistently over many CNV frequencies and bandwidth 
choices: although the various transformations do not dramatically alter power, the normalizing 
transformation (Z) is most powerful for signed test results, while the log transformation is most 
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Figure 4: Effect of transformation choice on power. Population CNV frequency was set to 10%; 
optimal bandwidths used. 



powerful for unsigned test results. In the results that follow, unless otherwise specified, we employ 
the normalizing transformation for signed test results and the log transformation for unsigned tests. 



4.3 Choice of kernel 




Figure 5: Effect of kernel choice on power. Left: Constant-width kernel vs. constant-marker kernel. 
Right: Flat vs. Epanechnikov kernel. In both plots, population CNV frequency was 10%, test 
results were unsigned, and the log transformation was used. 

In this section, we examine two aspects of kernel choice: bandwidth implementation (constant- 
width vs. constant-marker) and kernel shape (flat vs. Epanechnikov), defined in Section 2.1. When 
all markers are equally spaced, the constant-width and constant-marker kernels are equivalent. To 
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examine the impact on power when markers are unequally spaced, we selected at random a 200- 
marker sequence from chromosome 3 of the Illumina HumanHap 550K genotyping chip and spiked 
in CNVs of various sizes. The optimal bandwidth (either in terms of the number of markers or 
base pairs spanned by the underlying CNV) was chosen for each method. 

The left side of Figure 5 presents the results of this simulation. The constant-marker approach 
is substantially more powerful. When the number of markers is not held constant, the aggregation 
measure Tj is more highly variable for some values of j than others. This causes the null distribution 
of T max to have thicker tails, which in turn increases the p- value for the observed T max , thus lowering 
power. This phenomenon manifests itself most dramatically for small bandwidths. Consequently, 
throughout the rest of this article, we employ constant-marker kernels for all analyses. 

The right side of Figure 5 presents the results of changing the kernel shape from the flat kernel 
described in (2.2) to the Epanechnikov kernel described in (2.3). We make several observations: 
(1) The shape of the kernel has little impact on power; the two lines are nearly superimposed. (2) 
The kernel approach is relatively robust to choice of bandwidth; even 5-fold differences between 
the bandwidth and optimal bandwidth do not dramatically decrease power. (3) Nevertheless, the 
optimal bandwidth does indeed occur when the number of markers included in the kernel matches 
the true number of markers spanned by the CNV. (4) The Epanechnikov kernel is slightly more 
robust to choosing a bandwidth that is too large than the flat kernel is. This makes sense, as the 
Epanechnikov kernel gives less weight to the periphery of the kernel. 

4.4 Kernel-based aggregation vs. variant-level testing 

Lastly, we compare the kernel-based aggregation approach with variant-level testing. To imple- 
ment variant-level testing, each sample was assigned a group ("variant present" or "variant absent") 
on the basis of whether a CNV was detected by CBS. A two-sample i-test was then carried out 
to test for association of the CNV with the phenotype. This variant-level approach was compared 
with kernel-based aggregation of marker- level testing for a variety of bandwidths. The results are 
presented in Figure 6. 

For rare CNVs (5% population frequency), the power of the variant-level approach and the 
aggregated marker-level approach are comparable. However, for more common CNVs, the marker- 
level approach offers a substantial increase in power. For the most part, this increase in power 
persists even when the bandwidth is misspecified. Only when the misspecification approached five- 
fold levels (selecting a 10-marker bandwidth for a 50-marker CNV) did the variant-level approach 
surpass marker-level aggregation. 

Generally speaking, these results are consistent with the findings reported in Breheny et al. 
(2012), who found that variant-level tests have optimal power relative to marker-level tests when 
CNVs are large and rare; conversely, marker-level tests have optimal power relative to variant-level 
tests when CNVs are small and common. This is understandable given the limited accuracy of 
calling algorithms for small CNVs. 

However, comparing the results in Figure 6 with the results of Breheny et al. (2012), we find 
that the kernel approach is a substantially more powerful method for aggregating marker-level tests 
than a change-point approach. Specifically, Breheny et al. found that the change-point approach 
had very low power at 5% frequency - much lower than the variant-level approach. On the other 
hand, in the same setting we find that the kernel approach is comparable to, and even slightly more 
powerful than, the variant-level approach. 
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Figure 6: Power comparison of variant-level testing (using CBS for CNV calling) with marker-level 
testing (using kernel-based aggregation). 



As discussed in Section 3.1, a change-point analysis of marker- level tests also relies on exchange- 
ability, which does not always hold. Thus, the methods developed in this article are both more 
powerful and achieve better control over the FWER than the change-point analysis described in 
Breheny et al. (2012). 

A potential drawback of the kernel approach is the need to specify a bandwidth. This makes 
the robustness of the method to bandwidth misspecification, as illustrated in Figure 6, particularly 
important because in practice it is difficult to correctly specify the bandwidth a priori. Indeed, it 
is possible that multiple CNVs associated with the outcome are present on the same chromosome 
and have different lengths. A method that is not robust to bandwidth we be incapable of detecting 
both CNVs. Generally speaking, a bandwidth of roughly 30 markers seems to provide good power 
over the range of CNV sizes that we investigate here. 



5 Discussion 

The simulation studies of Section 4 address a limited-scale version of a larger question: how do 
marker-level test aggregation and variant-level testing compare for chromosome-wide and genome- 
wide analysis? This is an important question and deserves further study. In general, multiplicity 
is a thorny issue for CNV analyses, as the true location of CNVs are unknown and can overlap 
in a number of complicated ways. The issue of how many tests to carry out and adjust for is 
a challenging question for variant-level testing and a considerable practical difficulty in analysis. 
In contrast, aggregation of marker-level results avoids this issue altogether. We have shown that 
the proposed approach is both powerful at detecting CNV associations and rigorously controls the 
FWER at a genome-wide level — two rather appealing properties. However, future work applying 
the proposed method to larger, more complex settings is necessary. 

The chief drawback of the approach is the need to carry out permutation testing. The kernel 
aggregation itself is very fast, but the need to carry out ~ 1, 000 permutation tests for each marker 
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may be highly computationally intensive (although easy to parallelize), depending the complexity 
of the marker- level test. Whether or not it is possible to speed up the approach described here 
with closed-form approximations is another important area of future work. 
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Appendix 

Proof of Theorem 1. Let V denote the set of all possible permutations of {m}, Fq the CDF of T max 
over V , and Fq" 1 its generalized inverse. Also, let </>(X, y) = 1 if T max (X, y) > — a) and 

otherwise. 

Now, note that under the null hypothesis that x, and yi are independent, 

P(X,y) = Y[P(xi, yi ) 

i 

= IJP( Xi )P(y i ) 

i 

= P(X,y*) 

for all y* G V. Thus, E o 0(X,y*) is a constant for all y* and 

E o {0(X,y)} = 1 E o^(X,y*) 

< a, 

where the term inside the expectation in the second line is less than or equal to a for all X and y 
by the construction of the test. □ 
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